The article presents the GitHub repository for "generalized-kmeans-clustering," a production-ready K-Means clustering solution designed for Apache Spark. It features pluggable Bregman divergences, offers advanced variants of K-Means, and includes a modern DataFrame API for seamless integration with Spark ML. The repository has extensive testing and supports a range of distance functions suitable for various data types.
clustering ✓
apache spark ✓
k-means ✓